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Abstract 

This paper mainly utilizes likelihood-based tests to detect rare variants associated with a continuous phenotype under the 
framework of kernel machine learning. Both the likelihood ratio test (LRT) and the restricted likelihood ratio test (ReLRT) are 
investigated. The relationship between the kernel machine learning and the mixed effects model is discussed. By using the 
eigenvalue representation of LRT and ReLRT, their exact finite sample distributions are obtained in a simulation manner. 
Numerical studies are performed to evaluate the performance of the proposed approaches under the contexts of standard 
mixed effects model and kernel machine learning. The results have shown that the LRT and ReLRT can control the type I 
error correctly at the given a level. The LRT and ReLRT consistently outperform the SKAT, regardless of the sample size and 
the proportion of the negative causal rare variants, and suffer from fewer power reductions compared to the SKAT when 
both positive and negative effects of rare variants are present. The LRT and ReLRT performed under the context of kernel 
machine learning have slightly higher powers than those performed under the context of standard mixed effects model. We 
use the Genetic Analysis Workshop 17 exome sequencing SNP data as an illustrative example. Some interesting results are 
observed from the analysis. Finally, we give the discussion. 
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Introduction 

For next-generation sequencing data identifying rare variants 
associated with phenotypes of interest is both practically and 
theoretically important [1-3]. Here the rare variant is typically 
defined as allele with minor allele frequency (MAF) less than 1 % . 
The past few years have witnessed increasing evidence that the 
rare variants play an important role in many complex diseases and 
disorders [4—16]. There are also some other findings supporting 
the contributions of rare variants to the diseases. For example, 
according to the odds ratio (OR) distribution, it has been 
demonstrated that most rare variants have values above 2 and 
the mean OR is 3.74, while very few common variants (defined as 
MAF>1%) have values above 2 and the mean OR is 1.36 [17]. 
See also Box 1 in Cirulli and Goldstein [2]. 

However, it is a very challenging task to detect the casual rare 
variants due to their extremely low MAF. For rare variant 
association analyses the single locus methods designed for common 
variants are rather underpowered or not applicable [1,18-20], 
thus developing appropriate statistical approaches especially for 
rare variant has become an active research topic recendy. A type 
of methods has been proposed by collapsing the rare variants 
within a functional region (e.g., gene and pathway) into one 
variant and then testing this collapsed variant [21-23]. In this 
paper, those tests are referred to as the burden test since they share 



the similar reasoning of collapsing. The burden test may be limited 
because it explicitly assumes that the variants within the collapsed 
region have the same direction of effect. However, in practice both 
protective and deleterious effects exist [1,18,19,24-26]. 

More recently, Wu et al. [18] proposed the sequence kernel 
association test (SKAT) for rare variant detection. The SKAT is a 
score based variance component test originally developed by Lin 
[27] under the framework of mixed effects model [28], and has 
been widely applied to pathway or gene set analyses [29-32]. Two 
very attractive features of the SKAT are that: (I) it avoids the 
directionality of effect and consequently can enhance the statistical 
power when both protective and deleterious effects are present; (II) 
it proceeds under the framework of kernel machine learning, and 
thus can capture more complicated nonlinear relationship among 
rare variants. 

The SKAT, however, has itself shortcomings as argued by Zhan 
and Xu [16]. For SKAT, a large score value (i.e., a small p value) 
does not necessarily mean the effect of a group of rare variants is 
also great, it may be due to a lot of variants with very weak effects. 
Additionally, when examining a set of rare variants, geneticists and 
epidemiologists may need some metrics to measure their 
contribution together, like OR in logistic regression or estimated 
coefficient in linear regression in single locus association analysis. 
While the SKAT will not involve any parameter estimation, thus 
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cannot show effect differences across various sets of rare variants. 
Consequently, methods for rare variants with the capability to 
offer such information are desirable. 

Motivated by the arguments above, in this paper we adopt the 
likelihood ratio test to detect the rare variants. Both the likelihood 
ratio test (LRT) and the restricted likelihood ratio test (ReLRT) 
are investigated and are performed under the same framework of 
mixed effects model of SKAT. A great advantage of LRT and 
ReLRT is that they not only examine the effect of a group of rare 
variants but also offer an effect measurement; this value in turn 
can be used to evaluate the relative importance of rare variants. 
To our best knowledge, the likelihood-based methods for rare 
variants have been not published before, nor are investigated 
under the framework of kernel machine learning, although the 
LRT and ReLRT are particularly popular in the literature. 

In the rest of the paper, the SKAT and the burden test are first 
introduced, and then the LRT and ReLRT are discussed under 
the mixed effects model context and the kernel machine learning 
context, respectively. In this section, we will interpret how the 
kernel machine learning can be addressed with the mixed effects 
model and examine a group of rare variants via LRT and ReLRT. 
By using the eigenvalue representation of LRT and ReLRT, their 
exact finite sample distributions are obtained in a simulation 
manner. We perform extensive numerical studies to evaluate the 
performance of the proposed approaches and compare with the 
burden test and SKAT. The exome sequencing data from Genetic 
Analysis Workshop 17 (GAW17) is used as a practical application. 

Methods 

Notation 

Let X =[*!,..., denote the covariate vector of order p such as 
age, sex, smoking, and environmental exposure, and G=\gi, gi, 
g ra ] the genotype vector of order m for rare variants within a 
functional region specified a priori. In the paper, we use the 
additive genetic model, so that g=0, 1, and 2 represent the 
number of minor alleles. For example, in the GAW17 data 
[33,34], there are 16 single nucleotide polymorphisms (SNPs) 
included within the gene KDR, then the genotype can be expressed 
as G= g 2 , g\el- Let ^denote the continuous phenotype of 
interest (e.g., weight, blood pressure, and triglyceride) andjc,, i = 1, 
2, n its realization values, here n is the sample size. Suppose 
further that the phenotype T follows a normal distribution with 
variance ff 2 conditional on the covariates X and genotypes G. 

Mixed effects model 

First consider the linear mixed effects model [28,35] 

Y = P o + X0 + Gy + e, 

(1) 

e~N(0,a 2 I„), 

where fi=[fi\, Ppl are the fixed effects for covariates, /?o is the 
intercept, and I„ is an identity matrix of order n\ here y=[y\, 
y m ] are the random effects for rare genotypes, each yj,j= 1, 2, 
m is assumed to be normally distributed with mean zero and 
variance xwf , where T is a variance component and w is a 
prespecified weight related to MAF. For rare variant, w = Beta(- 
MAF; 1, 25) is recommended in Wu et al [18], which places more 
weight on rarer variant and less weight on common variant, where 
Beta is the beta density function. In the present paper we also 
follow this idea, but make a slight modification. That is, a scaled 
weight of Wj = Wj/mzx(u>) is used, where the notation max indicates 
the maximum over all the wjs. In our experience, this modification 



is necessary to avoid numerical imprecisions encountered in the 
statistical software, such as the R statistical environment [36]. 
Greven et al. [37] gave a full description regarding this issue when 
performing the restricted likelihood ratio test for zero variance 
component in the linear mixed effects model. 
Under these conditions, we can obtain 

Var(y) = TGWWG' + ff 2 I„ = (7 2 V > „ (2) 

where X = z/a' 2 , V^ = XGWWG' + I„, W is a diagonal matrix of 
order m with elements being w. Clearly testing whether or not a 
group of rare variants are collectively associated with the 
phenotype is equivalent to testing the null hypothesis H 0 : X = 0. 
Note that the classical definition of heritability is defined as tI 
(t+c 2 ), i.e., the proportion of phenotypic variance explained by a 
group of rare variants [38], then the heritability can be further 
expressed as X/(\+X). Therefore the quantity X is an analogue of 
the heritability and can be employed for measuring the relative 
impotence of different groups of rare variants. 

Sequence kernel association test (SKAT) 

According to Lee et al. [39] and Lee et al. [40], the original 
SKAT in Wu et al. [18] and the burden test can be studied within 
a unified framework if taking into account the correlation structure 
of the random effects. Suppose that the correlation structure 
among the m rare variants is R,,, which is determined by the 
pairwise correlation coefficient corr(g,, gi) = p between any variants 
j and /. The unified SKAT statistic is given as 

Q =(y-Y) GWRpWG^Y-Y), ^ 

R P =(1-P)l m +Pl m l'm, 

where Y is the predicted value under H 0 . The test in Equation (3) 
is called the optimal SKAT (SKAT-O) since it can choose the 
correlation coefficient p adaptively to maximize the power when 
all the effects are in the same direction [39,40]. 

When p = 0 (i.e., independent correlation), the SKAT-O 
reduces to the original SKAT in Wu et al. [18] and Lin [27], 
and when p = 1 (i.e., perfect correlation), the optimal SKAT 
reduces to the burden test. 

Under H 0 , Q/ollows a mixture of chi-square distributions, the p 
values for the burden test and SKAT are obtained by the Davies 
method [4 1] or other methods [42,43] . The p value for the SKAT- 
O is obtained by using a grid search strategy [39,40] . 

Likelihood ratio test (LRT) and restricted likelihood ratio 
test (ReLRT) 

When examining variance component in the mixed effects 
model, the LRT and ReLRT are a natural alternative. Note that 
the null hypothesis H 0 : X = 0 is non-standard since under H 0 X is 
on the boundary of the parameter space [44—47], and X = 0 if and 
only if T = 0. The parameter space for X is Q. = [0, °°). 

Replacing y and c 2 in model (1) with their maximum likelihood 
(ML) estimators [47], we obtain the profile log-likelihood function 
up to a constant independent of the parameters 

L(X) = - X - {n\og(YT\\^P x Y) +log|V,|}, (4) 

where 
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p,=i„-x(x'v,- 1 x) Vv,- 1 . 

The LRT statistic is defined as 



(5) 



LRT„=2[su Vken L(X)-L(X=0)], 

=su R e n { - n log (F P \ Vj: 1 F) -log | V, | +» log( FP 0 F)}, 

(6) 

where 



P 0 =I„-X(X'X) 'X. 



(7) 



Using the spectral representation [37,47—49], it can be shown that 
LRT„ is equal to the following quantity in distribution 



= SUP;. EQ <;«l0g 



(8) 



1 + 



N„(a) 



-£log(l+^)+nlog(Y'P 0 Y) 



where {/s are the eigenvalues of matrix W 1/2 G'GW 1/2 , and 

7=1 n 



1 



(9) 



where /j/s are the eigenvalues of matrix W 1/2 G'P 0 GW 1/2 , and w/s 
are independently standard normal random variables. 

The ML estimator of a is biased downward since it does not 
take into account the loss in degrees of freedom due to estimation 
of y. While the restricted maximum likelihood (REML) method 
provides an unbiased estimator for <7 2 by using a set of n - /(linearly 
independent error contrasts [50-53]. The profile restricted log- 
likelihood function up to a constant independent of the parameters 
is given as 



(10) 



, { - log|V A | - (n- j p)log(Y'P',V,- 1 P,Y) -log|x'V. 'x|}. 



simulation-based algorithm for the finite sample distributions of 
LRT„ and ReLRT„. This algorithm has been shown to be rather 
fast and accurate. The p values of the LRT and ReLRT are 
obtained by comparing the observed statistics to those simulated 
values. 

Kernel machine learning 

So far we have discussed how to detect the causal rare variants 
by using the LRT and ReLRT which are developed under the 
standard mixed effects model context. In this section, we turn to 
the recently popular kernel machine learning, explore its 
relationship with the mixed effects model, and demonstrate how 
to detect the causal rare variants in the kernel machine learning 
context via LRT and ReLRT. As we will see, there is a close 
connection between these two statistical theories, which provides a 
more flexible way for rare variant detection with kernel methods. 

Using the same notation defined before, we describe the 
relationship between the phenotype T and genotypes G and 
covariates Xvia a semi-parametric linear model [30,31] 



y, = Po + X,fi+h(G,) + e,, 



(13) 



where h is an unknown smooth function lying in a Hubert space 
Hx generated by a positive definite kernel function K [3 1 ,54] . This 
space is called reproducing kernel Hilbert space (RKHS) under 
some regularity conditions [55-58]. The kernel function K 
essentially quantifies the genomic similarity or distance of two 
subjects and can be arbitrarily chosen as long as it satisfies the 
conditions of Mercer's theorem [55,57]. Model (13) is semi- 
parametric since the covariates X are fitted parametrically while 
the genotypes G axe fitted non-parametrically. 

To avoid over-fitting, estimation of h can be performed by 
maximizing the penalized log-likelihood function [31,59] 



L„ K {h\Q = - ± J2 [Vi-fa-*J-KG*)] 2 -\m% K , 



(14) 



where { is a penalization parameter controlling the balance 
between the goodness of fit and the complexity of the model 
[31,59], and the notation || ■ || is the norm in RKHS. The solution 
of h in Equation (14) is given in terms of the well-known 
representer theorem of Kimeldorf and Wahba [60] and Wahba 
[61] 



The ReLRT statistic is defined as 



h(G)=Y,a,K(G,G l ), 



(15) 



ReLRT„ = 2[sup^ e n L Re (X) - L Re {X = 0)] . 



(11) 



Using the similar reasoning for LRT„, it can be shown that 
ReLRT„ is equal to 

(12) 

/„« = 

SUp,Ufl<(H-p)l0g 



1 + 



N n (X) 



D„{k) 



-J2 log(l +^)-r(n-p)log(Y'P 0 Y) 

7= 1 



in distribution. 

By taking full advantage of the spectral representation used in 
Equations (8) and (12), Crainiceanu and Rupper [47] described a 



where a — [«i, a 2 > •••> a »] ls an unknown vector of parameters and 
TTis a reproducing kernel function [31,54]. 
We further rewrite h in the form of matrix as 



h{G) =Ka, 



(16) 



where K is an nxn kernel matrix with its elements being Mfli, Gj). 
Various kernel functions have been designed in genetic statistics 
[59,62], such as the linear kernel, the polynomial kernel, the 
Gaussian kernel, and the identify by state (IBS) kernel. The explicit 
forms for these kernels can be found in Wu et al. [18], Wu et al. 
[32], Liu et al. [31], Kwee et al. [29], and Liu et al. [30]. If a 
kernel is weighted, then it is called a weighted kernel. In the paper 
the scaled weight described in Section 2.2 is used. Additionally, 
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once the kernel function is chosen, we assume that K is known 
completely. Consequently, inference about h in model (14) 
immediately reduces to inference about a. 
Replacing h in Equation (14) with (16) yields 



1 



(17) 



- -_ v Y-/J 0 -X/?-K«) (Y-/J 0 -X/J-K a )--£||K a ||^. 

Following the results of Gianola et al. [54], Wahba [61], and 
Wahba [63], 



IIKal 



= a'Ka, 



(18) 



(19) 



Equation (17) is re-expressed as 
Ln K (h\0 

= - ( Y - p 0 - xp - Ka)'(Y - p 0 - Xp - Ka) - \ £a'Ka. 

From a Bayesian perspective [31,59], Equation (19) is the log- 
posterior distribution of p o , fi and a, thus can be described as the 
following hierarchical model 



Y\p 0 ,fi,a ~ N(p 0 + X/} + Ka,a 2 ) 



(20) 



where x — 1 /£. Since h = Ka, alternatively the hierarchical model is 
re-expressed as 



Y\p 0 ,p,h~N(p 0 +Xp + h,a 2 ), 
A~jV(0,TK), 
/W?ccl. 



(21) 



In the paper we use the hierarchical model (21) since it avoids the 
calculation of inverse matrix and therefore reduces the computa- 
tional cost. 

Based on the arguments described above, we can construct the 
relationship between the semi-parametric linear model (13) and 
the mixed effects model (1). That is, model (13) is equivalent to the 
following mixed effects model 



paper, LRT.M and ReLRT.M are used to indicate the LRT and 
ReLRT for the mixed effects model, LRT.K and ReLRT.K are 
used to indicate the LRT and ReLRT for the kernel machine 
learning, and LRT and ReLRT are used to indicate both the two 
types. 

Results 

Simulation datasets 

We generate genotypes based on the coalescent model for 
European population by using the package COSI [64]. A total of 
100 kb gene region is simulated. Randomly selected continuous 
30% subregions of the simulated genotypes are used. Variants with 
MAF less than 0.01 are defined as rare variants. Two covariates 
are considered, X\ is a standard normal variable and x 2 is a binary 
variable with rate 0.5, and mutually independent. The sample size 
n is 300, 400, and 500. 

For type I error simulations the phenotype is generated as 

j>~/y(1.0 + 0.5xi +0.5x2,1), 

and the number of runs is 2,000. In power simulations, 30% rare 
variants are causal variants, the effect size |y| is 0.3 | logioMAF | , 
which leads to a size of 1 .2 for MAF = 0.000 1 and a size of 0.6 for 
MAF = 0.01. Among the causal rare variants, 0%, 30% or 50% 
have negative effects, i.e., in these settings their effects are 
— 0.3 1 log 10 MAF| . For power simulations the phenotype is 
generated as 



y~jv(l.0+0.5*i +0.5x2+ YZ-itffi' 1 



where q is the number of chosen causal rare variants, g"s are the 
genotypes and y"s are the effect sizes given above. The number of 
runs is 1,000. The simulation characteristics under these 
specifications are displayed in Table 1. 

In the present paper, seven methods including burden test, 
SKAT, SKAT-O, LRT.K, ReLRT.K, LRT.M, and ReLRT.M 
are compared. The first three tests are performed in the package 
SKAT [18], and the LRT and ReLRT are performed in the 
package RLRsim[65]. In practice the weighted kernel has been 
empirically shown to be more powerful compared to its 
unweighted counterpart [18,29], thus here we only consider the 
former. For comparison only the weighted linear kernel is used 
since under this situation both the mixed effects models in (1) and 
(22) are well specified, and the burden test and the SKAT-O are 
only able to be performed on the linear kernel. 



Y = p 0 + Xp + Zh + e, 

(22) 

e~A(0,(7 2 L,). 

The differences between model (22) and model (1) mainly lie in 
two aspects: (I) here Z is an identify matrix of order n, while G in 
model (1) is of dimension nxm; (II) the unknown parameter h here 
is an n-dimensional vector with its covariance-variance matrix 
being tK, while in model (1) the unknown parameter y is an m- 
dimensional vector with its covariance-variance matrix being 
dicig{zwf), here the notation diag indicates a diagonal matrix. 

Therefore, all the theories for the LRT and ReLRT developed 
under the context of mixed effects model can be also applicable in 
the context of kernel machine learning. The test of variance 
component in model (22) can proceed similarly in model (1). To 
distinct these two types of approaches, in the reminder of the 



Type I error and power 

Table 2 displays the estimated Type I errors for all the tests. It 
can be seen from Table 2 that all the tests control the type I error 
correctiy at the given a level. Figure 1 shows the estimated 



Table 1. Simulation characteristics. 













Used rare 


Causal rare 


n 


Total SNPs 


Selected SNPs 


variants 


variants 


300 


417 


125 


41 


12 


400 


434 


130 


47 


14 


500 


447 


134 


51 


15 



doi:1 0.1 371 /journal.pone.0093355.t001 



PLOS ONE | www.plosone.org 



4 



March 2014 | Volume 9 | Issue 3 | e93355 



Rare Variants Detection with Kernel LRT 



50° o Negative 



30° o Negative 



0° a Negative 



1 


bji 2c 




SKAT 


n 


SKAT-0 


□ 


IRT.K 




RftLRTK 




LRT.M 




RfcLRT.M 



>:- 

"I 



J 



1 



300 400 500 
Sample size 




300 400 500 




300 ^00 



500 



r 




i 


SKAT 


□ 


SKAT-0 


□ 


L3T.K 


c 


ReLRT.K 


L_ 


■• 


■ 


ReLRT.M 



i 

a. 




1 



300 400 500 
Sample size 





500 



Figure 1. Estimated power for all the tests. The top panel is for a = 0.01 and the bottom panel is for a = 0.05. M% negative means that in these 
associated SNPs M% have effects -0.3|log 10 MAF| and the rest (100-M)% are 0.3|log 10 MAF|. 
doi:1 0.1 371 /journal. pone.0093355.g001 

Table 2. Estimated Type I error. 





n 


Burden 


SKAT 


SKAT-O 


LRT.K 


ReLRT.K 


LRT.M 


ReLRT.M 


a = 0.05 


300 


0.051 


0.042 


0.043 


0.047 


0.052 


0.041 


0.046 


400 


0.060 


0.044 


0.055 


0.050 


0.056 


0.048 


0.051 


500 


0.058 


0.043 


0.054 


0.042 


0.046 


0.040 


0.042 


a = 0.01 


300 


0.010 


0.008 


0.008 


0.011 


0.012 


0.010 


0.010 


400 


0.012 


0.008 


0.012 


0.010 


0.012 


0.010 


0.010 


500 


0.010 


0.008 


0.010 


0.010 


0.010 


0.009 


0.010 


doi:1 0.1 371 /journal.pone.0093355.t002 
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Table 3. Losses of the power for a = 0.05 . 



n 


burden 


SKAT 


SKAT-O 


LRT.K 


ReLRT.K 


LRT.M 


ReLRT.M 


30% Negative* 


300 


0.456 


0.100 


0.280 


0.098 


0.087 


0.099 


0.096 


400 


0.574 


0.089 


0.279 


0.086 


0.079 


0.081 


0.084 


500 


0.593 


0.059 


0.233 


0.058 


0.065 


0.052 


0.063 


$ 

Average 


0.541 


0.083 


0.264 


0.081 


0.077 


0.077 


0.081 


50% Negative* 


300 


0.537 


0.151 


0.356 


0.128 


0.132 


0.124 


0.127 


400 


0.667 


0.096 


0.313 


0.073 


0.066 


0.073 


0.071 


500 


0.692 


0.068 


0.247 


0.050 


0.052 


0.043 


0.053 


$ 

Average 1 


0.632 


0.105 


0.305 


0.084 


0.083 


0.080 


0.084 



& : The values are differences of power between the situation with none of the causal variants (i.e., 0%) being negative and the situation with 30% or 50% causal variants 
being negative. 

*: It means that 30% or 50% causal variants are negatively related to phenotype with effects — 0.3|log 10 MAF| and the rest 70% or 50% are positively related to 
phenotype with effects 0.3|log 10 MAF|. 
: The average is calculated across sample sizes. 
doi:1 0.1 371 /journal.pone.0093355.t003 



powers. Tables 3 and 4 present the losses of power for the 
situation that 30% or 50% causal rare variants have negative 
effects compared to the situation that none of the causal rare 
variants has negative effects. These values are obtained according 
to Figure 1 . The average values in Tables 3 and 4 are calculated 
across sample sizes. 

Some important observations from Figure 1, Tables 3 and 4 
are listed as follows. 

(I) When all the causal rare variants have the same direction 

of effect, the burden test and the SKAT-O are the most 
powerful, following by the LRT, ReLRT, and SKAT. 
These results are expected since both the burden test and 
the SKAT-O are designed especially for this situation. 

(If) When both positive and negative effects are present, all 
the tests suffer from power decrease. Under this situation, 
the LRT and ReLRT have the highest powers, and the 

Table 4. Losses of the power for a = 0.01 & . 



burden test suffers from the most reduction of power. For 
example, when a. — 0.05, n — 500 and all causal rare 
variants are in the same direction, for the burden test its 
power is 0.817, while its power decreases to 0.224 when 
30% causal rare variants have negative effects and 0.125 
when 50% causal rare variants have negative effects. The 
SKAT-O is no longer optimal and has a smaller power 
compared to the SKAT, suggesting that in practice using 
the SKAT rather than the SKAT-O may be safer since 
the situation that positive and negative effects occur 
simultaneously is more frequent than the situation that all 
the effects are in the same direction. Compared with the 
SKAT, the LRT and ReLRT reduce fewer powers, 
implying these two tests are relatively more robust to the 
mixture effects of rare variants. 



n 


burden 


SKAT 


SKAT-O 


LRT.K 


ReLRT.K 


LRT.M 


ReLRT.M 


30% Negative* 


300 


0.355 


0.080 


0.233 


0.072 


0.078 


0.068 


0.074 


400 


0.490 


0.092 


0.330 


0.081 


0.086 


0.083 


0.083 


500 


0.530 


0.052 


0.279 


0.039 


0.038 


0.038 


0.036 


$ 

Average 


0.458 


0.075 


0.281 


0.064 


0.067 


0.063 


0.064 


50% Negative* 


300 


0.393 


0.124 


0.301 


0.094 


0.094 


0.090 


0.091 


400 


0.545 


0.105 


0.373 


0.089 


0.092 


0.090 


0.087 


500 


0.592 


0.074 


0.310 


0.054 


0.065 


0.054 


0.056 


$ 

Average 


0.510 


0.101 


0.328 


0.079 


0.084 


0.078 


0.078 



& : The values are differences of power between the situation with none of the causal variants (i.e., 0%) being negative and the situation with 30% or 50% causal variants 
being negative. 

# : It means that 30% or 50% causal variants are negatively related to phenotype with effects — 0.3|logi 0 MAF| and the rest 70% or 50% are positively related to 
phenotype with effects 0.3|log 10 MAF|. 
: The average is calculated across sample sizes. 
doi:1 0.1 371 /joumal.pone.0093355.t004 
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Table 5. Characteristics of the used GAW1 7 data*. 





Gene 


Chr 


Total 


Rare 


Causal 


MAF 


Causal Effects 


HIF3A 


19 


21 


15 


3 


7.17x1(T 3 ~0.385 


0.174668, 0.51468, 0.265181 


FLT1 


13 


35 


25 


11 


7.17x1(T 3 ~0.291 


0.18047, 0.457361, 0.732566, 0.839669, 
0.38582, 0.549816, 0.623466, 0.653351, 
0.59670, 0.549214, 0.090586 


KDR 


4 


16 


14 


10 


7.17x1(T 3 ~0.165 


0.598271, 0.715613, 0.503025, 1.17194, 
0.149975, 0.610938, 0.318125, 0.312058, 
1.171940, 0.417977 



*: Chr indicates the chromosome, Total indicates the total number of SNPs contained in the gene, and Rare indicates the number of rare SNPs within the gene. 
doi:1 0.1 371/joumal.pone.0093355.t005 



(III) It can be seen that the LRT and ReLRT consistently 
outperform the SKAT regardless of the sample size and 
the proportion of the negative causal rare variants. 

(IV) The ReLRT always has a higher power than the LRT, 
which may stem from the fact that the ReLRT gives the 
unbiased estimator of variance component. 

(V) The LRT.K versus LRT.M and the ReLRT.K versus 
ReLRT.M behave comparably, but it is interesting that 
the ReLRT.K has a slightly larger power than the 
ReLRT.M, and the LRT.K also has a slightly larger 
power than the LRT.M. 

Application 

We apply these methods to the unrelated samples of the 
GAW17 [33,34]. The GAW17 data contains 24,487 SNPs across 
3,205 autosomal genes on 697 individuals, three covariates (age, 
sex and smoke), three quantitative traits (Ql, Q2 and Q4), and a 
binary trait. Most of the SNPs are rare with MAT ranging from 
0.07% to 25.8%, 74% have MAP less than 0.01 and 12.8% have 
MAF more than 0.05. This data was widely used on Genetic 
Analysis Workshop 17 to evaluate the newly developed methods 
for rare variant detection and compare to the existing ones. 

Here we choose the quantitative trait Ql, and select the SNPs 
within genes HIF3A, FLT1 and KDR. These selected genes are 
rather typical for our comparison of the methods. For HIF3A, 20% 
SNPs are causal rare variants with weak effects. For FLT1, 44% 
SNPs are causal rare variants with moderate effects. For KDR, 
7 1 .4% SNPs are causal rare variants with relatively strong effects. 
The characteristics of the selected data are depicted in Table 5 . 
More detailed information regarding GAW17 data can be found 
in Almasy et al. [34] . 

We use the weighted linear kernel and define the rare variant as 
those with MAF less than 0.01, so the SNPs with MAF greater 
than such cut point are not included in the analysis. The results are 



Table 6. Results of the used GAW17 data. 



listed in Table 6. The two types of LRT and ReLRT lead to the 
same results; to save space only one type is reported. 
Some interesting results are observed form Table 6. 

(I) Since all the causal rare SNPs within each gene are 
positively related to the phenotype Ql [34], the burden test 
and SKAT-O have the smallest p values compared to other 
methods. The LRT and ReLRT obtain smaller p values 
than SKAT, and the ReLRT always has smaller p values 
compared to LRT. 

(II) Due to the weak effects and small proportion of rare 
variants, the HIF3A cannot be discovered by all the 
methods; while the FLT1 and KDR are successfully 
detected. But here it is noted that the p value of SKAT 
(1.29 xlO" 3 ) for KDR is much larger than those of LRT 
and ReLRT (with scale of 10" 5 ). 

(III) The burden test, SKAT, and SKAT-O cannot give any 
evidence regarding the effect of the gene. For instance, 
FLT1 and KDR can be viewed as moderate and strong 
signals, respectively, but instead the former has a much 
smaller p value than the latter. This may show a mistaken 
impression that the FLT1 is more associated with the 
phenotype. Fortunately, the estimates of X provided by 
LRT and ReLRT display the distinction, that is, the value 
of X for KDR is larger than that for FLT1. From Table 6, it 
can be seen that the estimates of X correctly reveal the 
effect strength of different genes. Here the result empiri- 
cally documents that the LRT and ReLRT are preferred to 
the SKAT when comparing the contributions of various 
genes based on a set of rare variants. 

Discussion 

In this paper we have proposed the LRT and ReLRT to detect 
the rare variants associated with complex phenotypes from both 
the standard mixed effects model framework and the kernel 



p value "k 

Gene Burden SKAT SKAT-O LRT ReLRT LRT ReLRT 

HIF3A 0.262 0.483 0.420 0.388 0.387 <0.001 -C0.001 

FLT1 6.12x10~ 8 9.01 x10~ 7 1.03x10~ 9 6.28x10~ 7 5.44x10~ 7 0.750 0.748 

KDR 9.27 x10~ 7 1.29x10~ 3 2.78 x10~ 6 4.99 x10~ 5 4.83 x10~ 5 1.778 1.767 



doi:1 0.1 371/journal.pone.0093355.t006 
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machine learning context. In the latter, the original space of 
genotypes is mapped to another higher dimensional space by the 
kernel function. Such a space may be potentially infinite 
dimensional and is referred to as a feature space in the machine 
learning literature where the model can proceed linearly 
[55,57,66]. An important advantage of kernel methods is that 
we do not have to construct the feature space explicitly since all the 
analyses can be finished direcdy over the kernel [66]. In fact the 
kernel function itself is frequendy more efficient to compute than 
the map function or the inner product induced in H K [55,57]. 

By using the representer theorem, the connection between the 
kernel machine learning and the mixed effects model is well 
established from the Bayesian point of view. This connection 
provides a convenient way to examine the rare variants under the 
context of kernel methods using the LRT and ReLRT. We can 
find that the kernel is actually the covariance structure for the 
random effects h, so it can be thought to be the prior correlation 
among subjects. 

Our simulations have demonstrated that the performance of the 
LRT and ReLRT in the two contexts is comparable. However, it 
can be expected that the methods of LRT.K and ReLRT.K 
should be more flexible and attractive although only the linear 
kernel function is employed in the paper; but even then the 
LRT.K and ReLRT.K have displayed slighdy larger powers than 
the LRT.M and ReLRT.M. Extending the proposed LRT and 
ReLRT to other kernel functions needs no any additional efforts, 
but more applications in practice are required to further 
understand the behaviors of various kernels. The choice of a 
kernel function is dependent on which feature space is used to 
approximate h [30,31]. Liu et al. [31] showed that in a simulation 
example the Gaussian kernel performed the best compared to 
other competing kernels. 

In the paper, the exact finite sample distributions of LRT and 
ReLRT obtained by simulation are employed. One may attempt 
to use the 50:50 mixture distribution of Xo an d X\ [44-46], where 
Xq is a point probability mass at zero and x\ i- s a chi-square 
distribution with 1 degree of freedom. However, it has been 
displayed that this mixture distribution is conservative [37,47]. It is 
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